Walktober

  • Our department had a walkathon in october where we all competed to see how many steps we could walk each day

Data quality angel

Before the competition started, I searched up the most accurate and cost effective pedometer

Pedometer = bad

  • It turns out, pedometers are wildly inaccurate

Data quality demons

  • It turns out I was the ONLY person concerned about data quality
  • We conducted a survey after walktober, to see if we could quantify the pedometer error
  • Turns out the measurement error was the least of my worries

Survey response

Quantifiable vs unquantifiable uncertainty

  • I can incorporate pedometer error estimates into our analysis, I CANNOT work with completely falsified data
  • This is the difference between quantifiable vs unquantifiable uncertainty
  • We are going to try and quantify the uncertainty that we can quantify
    • “Anything worth doing is worth doing poorly” - G. K. Chesterton

Not an uncommon scenario

Often our data is….

  • Unavailable,
    • e.g. anonymised data, measurement error, etc.
  • Non-deterministic
    • e.g. bounded data, estimated values, etc.
  • or Theoretical
    • e.g. estimates based on theory, latent variables, etc

How many statisticans does it take to visualise a random variable

  • Even though we usually work with random variables, are unable to visualise them effectively
  • Our choice of error distribution might change the conclusion of our analysis in unexpected ways
  • Often our solution is to just ignore the inherrent uncertainty in our data

The visualisation challenge

  • Our department decides to do a visualisation challenge of the walktober data

I dont want to ignore it

  • But I did all that reading about pedometers, so I would like to incorporate that uncertainty

Spot the difference

  • Maps of temperature in Iowa counties
  • I chose two error distributions, can you spot the difference?

How do we include the uncertainty?

Exceedance probability map

  • If you care about the uncertainty, visualise the uncertainty

A terrible vet

A terrible vet

Uncertainty as signal vs noise

  • Uncertainty can play two roles in an analysis
    • Sometimes it is used to hedge or dampen our conclusions on other statistics
    • Sometimes it is a statistic of inference itself
  • A visualisation is a statistic which means, just like other statistics, we use them to draw inference
    • If we want to draw inference on uncertainty: visualise uncertainty as signal
    • If it is supposed to hedge our inference from the plot: it is noise
  • An exceedence probability map is fine if we want to draw inference on our uncertainty, but not fine if we were trying to hedge the original plot

Solution: add an axis for uncertainty

  • 2D palette is harder to read
  • Says: “We have a wave pattern, but it is uncertain”

I keep getting scammed

Why doesn’t this work?

  • Uncertainty is not just another variable…
    • It presents an interesting perceptual problem
  • Usually do not want variables to interfere with each other
    • In uncertainty visualisation, the opposite is true

Uncertainty visualisation for signal supression

  • Statistical validity translates to perceptual ease
    • The higher the variance on an estimate, the harder that estimate is to extract from the plot

Solution: blend the colours together!

  • Made signal harder to see… but maybe too hard?
  • Still have 2D Colour palette
  • Standard error at which to blend colours is made up

Free yourself from the two variable approach

  • Realistically, we are trying add information back in that we just shouldn’t have droppped
  • We need a more holistic apporach that doesn’t allow us to pick and choose when and how we include uncertainty
  • Uncertainty visualisation doesn’t have units of data, it has units of “random variables” so we should directly input random variables

Vectorise random variables with distributional

steps_dist team name
N(23679, 4633687)[0,Inf] iwalk() A
N(18322, 2774223)[0,Inf] iwalk() A
N(24562, 5e+06)[0,Inf] iwalk() A
N(26128, 5642050)[0,Inf] iwalk() A
N(10238, 866202)[0,Inf] iwalk() A
  • It turns out you can.
  • These columns are made using distributional
  • They are truncated normally distributed random variables
  • This is some of Mitch’s software, I am not going to explain it because Mitch is going to talk about it immediately after me

Solution: simulate a sample

  • Made using Vizumap’s pixelmap function
  • Gives the best overall understanding of our random variables
  • Not actually making any top level decisions, just letting the variance from the random variables carry through to the visual system
  • The signal seems harder to read
  • 1D colour palette

But lets take this one step further…

Universal application in ggdibbler

  • ggdibbler applies this concept to every plot and every aesthetic

Universal application in ggdibbler

Text plot

Spatial pixel map

Bar charts

Raster plots

Contour plots

ggdibbler also ensures your plots have nice statistical properties

  • Statistical properties are what differentiate us from the animals

Visual Continuous mapping theorem

Example in geom_tile

ggdibbler guarentees these properties

  • Not the default in ggplot2, you need nested positions

Back to walktober example

Have a go yourself

Future Plans

  • Future of the software
    • multivariate distributions and other complex more complex joint distributions
    • built out nested position system
    • expand on the scales to accept more object types
  • Unemployment
    • I also need a job (I am holding my software hostage)
    • If you want to give me a job, my email is harriet.m.mason@gmail.com

Acknowledgements

  • My Supervisors: Di Cook, Susan Vanderplas, and Sarah Goodwin
  • AEMO Zema Energy Schoalarship
  • Australian RTP Stipend
  • Numbat Hackathon (for the walktober data)
  • Mitch O’Hara-Wild and Cynthia Huang